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^Z. \ Abstract 

^^ , We consider tire problem of high-dimensional Gaussian graphical model selection. We 

identify a set of graphs for which an efficient estimation algorithm exists, and this algorithm 

C_^ , is based on thresholding of empirical conditional covariances. Under a set of transparent 

conditions, we establish structural consistency (or sparsistency) for the proposed algorithm, 
when the number of samples n = V,{J~^^\ogp), where p is the number of variables and 
■^rnin is the minimum (absolute) edge potential of the graphical model. The sufficient 
conditions for sparsistency are based on the notion of walk- summability of the model and 

S^ ■ the presence of sparse local vertex separators in the underlying graph. We also derive novel 

^ , non- asymptotic necessary conditions on the number of samples required for sparsistency. 

Keyw^ords: Gaussian graphical model selection, high-dimensional learning, local-separation 
property, walk-summability, necessary conditions for model selection. 

1. Introduction 

Probabilistic graphical models offer a powerful formalism for representing high-dimensional 
distributions succinctly. In an undirected graphical model, the conditional independence 
relationships among the variables are represented in the form of an undirected graph. Learn- 
ing graphical models using its observed samples is an important task, and involves both 
structure and parameter estimation. While there are many techniques for parameter esti- 
mation (e.g., expectation maximization), structure estimation is arguab ly more challenging . 



High-dimensional structure estimation is NP-hard for general models (JKarger and Srebro 
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200 ll : iBogdanov et al.1 . |2008| ) and moreover, the number of samples available for learning is 
typically much smaller than the number of dimensions (or variables). 

T he complexity of stru cture estimation depends crucially on the underlying graph struc- 



ture. IChow and Liul (jl968l ) established that structure estimation in tree models reduces to 
a maximum weight spanning tree problem and is thus computationally efficient. However, a 
general characterization of graph families for which structure estimation is tractable has so 
far been lacking. In this paper, we present such a characterization based on the so-called lo- 
cal separation property in graphs. It turns out that a wide variety of (random) graphs satisfy 
this property (wi th probability t ending to one) including larg e girth graphs, the Erd os-Renyi 
random graphs (JBoUobasl . Il985l ) and the power-law graphs ( Chung and Lul. 120061). as well 
as graphs with short cycles such as the small-world gra phs ( Watts and Strogatj . Il998l ) and 
other hybrid/augmented graphs (jChung and Lul . l2006l . Ch. 12). 

Successful structure estimation also relies on certain assumptions on the parameters of 
the model, and these assumption s are tied to the specific algorithm e x nployed. For instance, 
for convex-relaxation approaches ( Meinshausen and Biihlmannl . l2006l : lRavikumar et al.l . l2008l ). 
the assumptions are based on certain incoherence conditions on the model, which are hard 
to interpret as well as verify in general. In this paper, we present a set of t ransparent condi- 



tions for Gaussian graphical model selection based on walk-sum analysis (iMalioutov et al. 



20061 ). Walk-sum analysis has been previously employed to analyze the performance of loopy 
belief propagation (LBP) and its variants in Gaussian graphical models. In this paper, we 
demonstrate that walk-summability also turns out to be a natural criterion for efficient 
structure estimation, thereby reinforcing its importance in characterizing the tractability of 
Gaussian graphical models. 



1.1 Summary of Results 

Our main contributions in this work are threefold. We propose a simple local algorithm for 
Gaussian graphical model selection, termed as conditional covariance threshold test (CCT) 
based on a set of conditional covariance thresholding tests. Second, we derive sample 
complexity results for our algorithm to achieve structural consistency (or sparsistency) . 
Third, we prove a novel non-asymptotic lower bound on the sample complexity required by 
any learning algorithm to succeed. We now elaborate on these contributions. 

Our structure learning procedure is known as the Conditional Covariance Teslu (CCT) 
and is outlined in Algorithm [TJ Let CCT {x^;^n,p,i]) be the output edge set from CCT given 
n i.i.d. samples x", a threshold ^n,p (that depends on both p and n) and a constant r/ G N, 
which is related to the local vertex separation property (described later). The conditional 
covariance test proceeds in the following manner. First, the empirical absolute conditional 
covarianceqj are computed as follows: 



S(i,j|5) :=S(i,i)-S(i,5)S (5,5)S(5,j) 



1. An analogous test is employed for Ising model selection in (JAnandkumar et al.l . l2011a ) based on con- 
ditional mutual information. We later note that conditional mutual information test has slightly worse 
sample complexity for learning Gaussian models. 

2. Alternatively, conditional independenc e can be tested via sample par tial correlations which can be com- 
puted via regression or recursion. See l|Kalisch and Biihlmannl . 120071 ) for details. 
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Algorithm 1 Algorithm CCT(x";^„^p,ry) for structure learning using samples x". 



Initialize GJJ = (y,0). 
For each i,j G V, if 



then add (i,j) to Gp. 
Output: GJJ. 



mm \^{i,j\S)\ > in,p, (1) 

SciV\{i,j\ 

\s\<v 



where 5](-,-) are the respective empirical variances. Note that S {S,S) exists when the 
number of samples satisfies n > l^l (which is the regime under consideration). The con- 
ditional covariance is thus computed for each node pair {i,j) G V"^ and the conditioning 
set which achieves the minimum is found, over all subsets of cardinality at most ij; if the 
minimum value exceeds the threshold ^n,p, then the node pair is declared as an edge. See 
Algorithm [1] for details. 

The computational complexity of the algorithm is 0{p^^'^), which is efhcient for small 
■q. For the so-called walk-summable Gaussian graphical models, the parameter rj can be 
interpreted as an upper bound on the size of local vertex separators in the underlying 
graph. Many graph families have small r] and as such, are amenable to computationally 
efficient structure estimation by our algorithm. These include Erdos-Renyi random graphs, 
power-law graphs and small-world graphs, as discussed previously. 

We establish that the proposed algorithm has a sample complexity of n = Q{J~^^logp), 
where p is the number of nodes (variables) and Jmin is the minimum (absolute) edge potential 
in the model. As expected, the sample complexity improves when Jmin is large, i.e., the 
model has strong edge potentials. However, as we shall see, Jmin cannot be arbitrarily large 
for the model to be walk-summable. We derive the minimum sample complexity for various 
graph families and show that this minimum is attained when Jmin takes the maximum 
possible value. 

We also develop novel techniques to obtain necessary conditions for consistent structure 
estimation of Erdos-Renyi random graphs and other ensembles with non-uniform distribu- 
tion of graphs. We obtain non- asymptotic bounds on the number of samples n in terms 
of the expected degree and the nur nber of nodes of the model. The techniques employed 
are information-theoretic in nature ( Cover and Thomad . I200Q ) . We cast the learning prob- 



lem as a source-coding problem and develop necessary conditions which combine the use of 
Fano's inequality with the so-called asymptotic equipartition property. 

Our sufficient conditions for structural consistency are based on walk-summability. This 
characterization is novel to the best of our knowledge. Previously, walk-summable models 
have been extensively studied in the context of inference in Gaussian graphical models. As 
a by-product of our analysis, we also establish the correctness of loopy belief propagation for 
walk-summable Gaussian graphical models Markov on locally tree- like graphs (see Section [5] 
for details). This suggests that walk-summability is a fundamental criterion for tractable 
learning and inference in Gaussian graphical models. 
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1.2 Related Work 



Give n that structure l e arnin g of general graphical models is NP-hard (jKarger and Srebrd . 
200 ll : iBogdanov et alj . |2008| ). the focus has be en on characte r izing classes of models on 



which learning is tractable. The seminal work of I Chow and Liul ( 19681 ) provided an efficient 
implementation of maximum-likelihood structure estimation for tree models via a maximum 
weighted spa nning tree algori thm. Error-exponent analysis of the Chow-Liu algorithm 
wa s studied (ITan et al.l . l20ldl and exten sions to general forest models were considered 



(Choi et al. 



bv iTan et al.l (120111 1 and iLiu et al 



2011 



(|201ll ). Learning trees with latent (hidden) variables 
) have also been studied recently. 

For graphical models Markov on general graphs, alternative approaches are required 
for structure estimation. A recent paradigm for structure estimation is based on convex 
relaxation, where an estimate is obtained via convex optimization which incorporates an 
"i-based penalty term to encoura g;e sparsity. For Gaussian gr a phica l models, such ap 



proach es have been consid ered in iMeinshausen and b"^I5^;^ H); IPavikumar et all 



( 20081 ): Id'Aspremont et al.l (J2008l ). and the sample complexity of the proposed algorithms 
have been analyzed. A major disadvantage in using convex-relaxation methods is that the 
incoherence conditions required for consistent estimation are hard to interpret and it is not 
straightforward to characterize the class of models satisfying these conditions. 

An alternative to the convex-relaxation approach is the use of simple greedy local algo- 
rithms for structure learning. The conditions requir ed for consist e nt est imation are typi- 
cally more transparent, albeit somewhat restrictive. iBresler et al.l ( 20081 ) propose an algo- 
rithm for structure learning of general graphical mod els Markov on bounded-degree graphs, 
based on a series of conditional-independence tests. lAbbeel et al.l (J2006l) propose an algo- 



rithm, similar in spi ri t, for l earning factor graphs w i th bou nded degree. ISpirtes and Meek 



Jiggi), ICheng et al.l (l2002l l. iKalisch and Biihlmannl (|2007l l and IXie and Gend (|2008l l pro^ 
pose condition al-independence tests for learning Bayesian networks on directed acyclic 
graphs (DAG). iNetrapalli et al.l (J20ld ) proposed a faster greedy algorithm, based on condi- 



tional entropy, for graphs with large girth arid bounded degree. However, all the works (IBresler et al. 
2008l : lAbbeel et alJ . l20od : ISpirtes and Meekl . ll995l : ICheng et aLl . l2002l : lNetrapani et al.l . l2O10l l 
require the maximum degree in the graph to be bounded (A = 0(1)) which is restrictive. 
We allow for graphs where the maximum degree can grow with the number of nodes. More- 
over, we establish a natural tradeoff between the maximum degree and other parameters of 
the graph (e.g., girth) required for consistent structure estimation. 

Necessary conditions for consistent graphical mod el selection provide a lower bound o n 
sample coinplexity and have been explored before bv ISanthanam and Wainwrightl ( 20081 ): 
Wang et al.l ( 2010l ). These works consider graphs drawn uniformly from the class of bounded 
degree graphs and establish that n = ri(A'^ logp) samples are required for consistent struc- 
ture estimation, in an p-node graph with maximum degree A, where k is typically a 
small positive integer. However, a direct application of these methods yield poor lower 
bounds if the ensemble of graphs has a highly non- uniform d i stribu tion. This is the 
case with the ensemble of Erdos-Renyi random graphs (JBollobasl . Il985l ) . Necessary condi- 
tio ns for structure esti r nation of Erdos-Renyi random graphs were derived for Ising models 
by lAnandkumar et al.l ( 20ld ) based on an information-theoretic covering argument. How- 
ever, this approach is not directly applicable to the Gaussian setting. We present a novel 
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approach for obtaining necessary conditions for Gaussian graphical model selection based 
on the notion of typicality. We characterize the set of typical graphs for the Erdos-Renyi 
ensemble and derive a modified form of Fano's inequality and obtain a non-asymptotic lower 
bound on sample complexity involving the average degree and the number of nodes. 

We briefly also point to a large body of work on high-dimensional covariance selec- 
tion under different notions of sparsity. Note that the assumption of a Gaussian graphi- 
cal model Markov on a sparse graph is one such formulation. Other notions of sparsity 
include Gaussian models with sparse covariance matrices, or having a banded Cholesky 
factorization. Also, note that many works consider covariance estimation instead of selec- 
tio n and in general, esti i nation guarantees can b e obtained under le ss st ringent conditions. 
SeelLam and FanI (120091 ) , iRothman et al.l (120081 ) , iHuang et al.l (J2006l ) and lBickel and Levina 



1200^) for details. 



Paper Outline The paper is organized as follows. We introduce the system model in 
Section [2j We prove the main result of our paper regarding the structural consistency of 
conditional covariance thresholding test in Section [3l We prove necessary conditions for 
model selection in Section HI In Section [5l we analyze the performance of loopy belief 
propagation in Gaussian graphical models. Section [6] concludes the paper. Proofs and 
additional discussion are provided in the appendix. 

2. Preliminaries and System Model 
2.1 Gaussian Graphical Models 

A Gaussian graphical model is a family of jointly Gaussian distributions which factor in 
accordance to a given graph. Given a graph G = (y,E), with V = {1, . . . ,p}, consider 
a vector of Gaussian random variables X = [Xi,X2, . . . ,Xp\'-^, where each node i ^ V is 
associated with a scalar Gaussian random variable Xi . A Gaussian graphical model Markov 
on G has a probability density function (pdf) that may be parameterized as 



/x(x) oc exp 



--x^Jcx + h^x 



(2) 



where Jq is a positive-definite symmetric matrix whose sparsity pattern corresponds to that 
of the graph G. More precisely, 

JG{i,j)=0 ^^ {i,3)iG. (3) 

The matrix Jg is known as the potential or information matrix, the non-zero entries J{i,j) 
as the edge potentials, and the vector h as the potential vector. A model is said to be 
attractive if Jij < for all i ^ j- The form of parameterization in ^ is known as the 
information form and is related to the standard mean-covariance parameterization of the 
Gaussian distribution as 

where /x := E[X] is the mean vector and S := E[(X — /i)(X — ^t)"^] is the covariance matrix. 
We say that a jointly Gaussian random vector X with joint pdf /(x) satisfies local 
Markov property with respect to a graph G if 

f{Xi\ Xj^^i) )= f{Xi\ Xy \i ) (4) 
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holds for all nodes i G V, where M{i) denotes the set of neighbors of node i £ V and, V \i 
denotes the set of all nodes excluding i. More generally, we say that X satisfies the global 
Markov property, if for all disjoint sets A,B C V, we have 

/(x^,xb|x5) = /(xa|x5)/(xb|x5). (5) 

where set S is a separato'f^ of A and B The l ocal and global M arkov properties are equivalent 
for non-degenerate Gaussian distributions ( Lauritzerj . Il996l ). 



Our results on structure learning depend on the precision matrix J. Let 

Jmin := min \J{i,j)\, Jmax := max \J{i,j)\, Dmin := min J(i,i). (6) 

{iJ)eG (i,i)eG i 

Intuitively, models with edge potentials which are "too small" or "too large" are harder 
to learn than those with comparable potentials. Since we consider the high-dimensional 
case where the number of variables p grows, we allow the bounds Jmin, Jma.x, and -Dmin to 
potentially scale with p. 

The partial correlation coefficient between variables Xi and Xj, for i ^ j, measures their 
conditional covariance given all other variables. These are computed by normalizing the 
off-diagonal values of the information matrix, i.e.. 

For all i G y, set R[i, i) = 0. We henceforth refer to R as the partial correlation matrix. 

An important sub-class of Gau s sian g raphical models of the form in ()33p are the walk- 
summable models (JMalioutov et al.l . l2006l ). A Gaussian model is said to be a-walk summable 



if 

||R|| < a < 1, (8) 

where R := [|i?(i, j)|] denotes the entry-wise absolute value of the partial correlation matrix 
R and || • || denotes the spectral or 2-norm of the matrix, which for symmetric matrices, is 
given by the maximum absolute eigenvalue. 

In other words, walk-summability means that an attractive model formed by taking 
the absolute values of the partial correlation matrix of the Gaussian graphical model is also 
valid (i.e., the corresponding potential matrix is positive definite). This immediately implies 
that attractive models form a sub-class of walk-summable models. For detailed discussion 
on walk-summability, see Section lA.ll 

2.2 Tractable Graph Families 

We consider the class of Gaussian graphical models Markov on a graph Gp belonging to 
some ensemble S(p) of graphs with p nodes. We consider the high-dimensional learning 
regime, where both p and the number of samples n grow simultaneously; typically, the 
growth of p is much faster than that of n. We emphasize that in our formulation the 
graph ensemble 9{p) can either be deterministic or random - in the latter, we also specify a 



3. A set 5* C 1^ is a separator for sets A and B if the removal of nodes in S partitions A and B into distinct 
components. 
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probability measure over the set of graphs in S (p) • In the setting where S {p) is a random- 
graph ensemble, let Px^g denote the joint probability distribution of the variables X and 
the graph G ~ S(p), and let /x|g denote the conditional (Gaussian) density of the variables 
Markov on the given graph G. Let Pg denote the probability distribution of graph G 
drawn from a random ensemble 9{p)- We use the term almost every (a.e.) graph G satisfies 
a certain property Q if 

lim Pg\G satisfies Q] = 1. 

In other words, the property Q holds asymptotically almost sureljo (a.a.s.) with respect to 
the random-graph ensemble 9(p)- Our conditions and theoretical guarantees will be based 
on this notion for random graph ensembles. Intuitively, this means that graphs that have a 
vanishing probability of occurrence as p — )■ cx) are ignored. 

We now characterize the ensemble of graphs amenable for consistent structure estimation 
under our formulation. To this end, we define the concept of local separation in graphs. See 
Fig. [Jfor an illustration. For 7 G N, let B^{i; G) denote the set of vertices within distance 7 
from i with respect to graph G. Let H^^i := G{B^{i)) denote the subgraph of G spanned by 
Bj{i]G), but in addition, we retain the nodes not in B^{i) (and remove the corresponding 
edges). Thus, the number of vertices in H^^i is p. 

Definition 1 (7-Local Separator) Given a graph G , a 7-local separator S'^(i,j) between 
i and j, for {i,j) ^ G, is a minimal vertex separator^ with respect to the subgraph Hy^i. In 
addition, the parameter 7 is referred to as the path threshold for local separation. 

In other words, the 7-local separator S.y{i,j) separates nodes i and j with respect to 
paths in G of length at most 7. We now characterize the ensemble of graphs based on the 
size of local separators. 

Definition 2 ((?7,7)-Local Separation Property) An ensemble of graphs satisfies {ri,j)- 
local separation property if for a.e. Gp in the ensemble, 

inax \Sy{i,j)\ <r]. (9) 

We denote such a graph ensemble by 9(p;??,7)- 

In Section [3l we propose an efficient algorithm for graphical model selection when the 
underlying graph belongs to a graph ensemble 9{p',i]^l) with sparse local separators (i.e., 
small rj, for r] defined in (l9|). We will see that the computational complexity of our proposed 
algorithm scales as 0{p'^'^'^). We now provide examples of several graph families satisfying 



4. Note that the term a.a.s. does not apply to deterministic graph ensembles S(p) where no randomness is 
assumed, and in this setting, we assume that the property Q holds for every graph in the ensemble. 

5. A minimal separator is a separator of smallest cardinality. 
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Figure 1: Illustration of /-local separator set S{i,j;G,l) for the graph shown above with 
/ = 4. Note that AA(i) = {a, b, c, d} is the neighborhood of i and the /-local 
separator set S{i,j;G,l) = {a,b} C Af{v,G). This is because the path along c 
connecting i and j has a length greater than / and hence node c ^ S{i,j; G, I). 



Example 1: Bounded-Degree 

We now show that the local-separation property holds for a rich class of graphs. Any 
(deterministic or random) ensemble of degree-bounded graphs SDeg(P)^) satisfies (?7,7)- 
local separation property with rj = A and arbitrary 7 G N. If we do not impose any further 
constraints on 9r )Rg, the comp utational complexity of our proposed algorithm scales as 
0(^5 ) (see also lBresler et alj (|2008 ) where the computational c omplexity is conipara ble). 



(|2008l l 



are 



Thus, when A is large, our proposed algorithm and the one in iBresler et al. 
computationally intensive. Our goal in this paper is to relax the usual bounded-degree 
assumption and to consider ensembles of graphs S(p) whose maximum degrees may grow 
with the number of nodes p. To this end, we discuss other structural constraints which can 
lead to graphs with sparse local separators. 

Example 2: Bounded Local Paths 

Another sufficient conditioqj for the [t], 7)-local separation property in Definition [2] to hold 
is that there are at most rj paths of length at most 7 in G between any two nodes (henceforth, 
termed as the (rj,^)- local paths property). In other words, there are at most rj — 1 number 
of overlapping^ cycles of length smaller than 27. 

In particular, a special case of the local-paths property described above is the so-called 
girth property. The girth of a graph is the length of the shortest cycle. Thus, a graph 
with girth g satisfies (r/, 7)-local separation property with 77 = 1 and 'j = g. Let ScivthiP', d) 
denote the ensemble of graphs with girth at most g. There are many graph constructions 



6. For any graph satisfying (77, 7)-local separation property, the number of vertex-disjoint paths of length 
at most 7 between any tw o non-neighbors is bounded above by ry, by appealing to Menger's theorem 
for bounded path lengths (JLovasz et al.l . Il978l ). However, in the definition of local-paths property, we 
consider all distinct paths of length at most 7 and not just vertex disjoint paths. 

7. Two cycles are said to overlap if they have common vertices. 



High-Dimensional Gaussian Graphical Model Selection 



which lead to large girth. For exam ple, the bipartite Ram anui an graph (jChune] . 119971 . p. 
107) and the random Cayley graphs ( Gamburd et all l2009l ) have large girths. 

The girth condition can be weakened to allow for a small number of short cycles, while 
not allowing for typical node neighborhoods to contain short cycles. Such graphs are termed 
as locally tree-like. For instance, the ensemble of Erdos-Renyi graphs 9er{p,c/p), where 
an edge between any node pair appears with a probability c/p, independent of other node 
pairs, is locally tree-like. The parameter c may grow with p, albeit at a controlled rate for 
tractable structure learning. We make this m ore precise in Example 3 in Section 13. 1[ The 
proof of the following result may be found in (jAnandkumar et al.l . l2011al . Lemma 3) . 



Proposition 3 (Random Graphs are Locally Tree-Like) The ensemble of Erdos-Renyi 
graphs 9er{p,c/p) satisfies the {rj,j)-local separation property in ([9]) with 



r/ = 2, 7< 



logp 
4 log c ' 



(10) 



Thus, there are at most two paths of length smaller than 7 between any two nodes in 
Erdos-Renyi graphs a.a.s, or equivalently, there are no overlapping cycles of length smaller 
than 27 a.a.s. Similar observations apply fo r the more general scale-free or power-law 
graphs ( Chung and Lul . 120061 : iDommers et al.l . |2010|). Along similar lines, the ensemble 
of A-random regular graphs, denoted by SReg(P)A), which is the uniform ensemble of 
regula r graphs with degre e A has no overlapping cycles of length at most 0(log^_^p) 
a.a.s. (jMcKav et al.l . |2004| . Lemma 1). 



Example 3: Small- World Graphs 

The previous two examples showed local separation holds under two different conditions: 
bounded maximum degree and bounded number of local paths. The former class of graphs 
can have short cycles but the maximum degree needs to be constant, while the latter class 
of graphs can have a large maximum degree but the number of overlapping short cycles 
needs to be small. We now provide instances which incorporate both these features: large 
degrees and short cycles, and yet satisfy the local se paration property. 

The class of hybrid graphs or augmented graphs ( Chung and Lul . 120061 . Ch. 12) consists 
of graphs which are the union of two graphs: a "local" graph having short cycles and a 
"global" graph having small average distances. Since the hybrid graph is the union of these 
local and global graphs, it has both large degrees and sho rt cycles. The simplest model 
Swatts {p, d, c/p) , first studied by IWatts and Strogata (jl998l ) , consists of the union of a d- 
dimensional grid and an Erdos-Renyi random graph with parameter c. It is easily seen that 
a.e. graph G ~ Swattsd*, d, c/p) satisfies (r/, 7)-local separation property in ([9]), with 



r] = d + 2, 7< 



logp 
4 log c ' 



Similar observations apply for more general hybrid graphs studied in ( Chung and Lul . 120061 . 
Ch. 12). 
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Counter-example: Dense Graphs 

While the above examples illustrate that a large class of graphs satisfy the local separation 
criterion, there indeed exist graphs which do not satisfy it. Such graphs tend to be "dense", 
i.e., the number of edges scales super- linearly in the number of nodes. For instance, the 
Erdos-Renyi graphs Ser{p,c/p) in the dense regime, where the average degree scales as 
c = Q{p'^). In this regime, the node degrees as well as the number of short cycles grow 
with p and thus, the size of the local separators also grows with p. Such graphs are hard 
instances for our algorithm. 

3. Guarantees for Conditional Covariance Thresholding 
3.1 Assumptions 

(Al) Sample Scaling Requirements: We consider the asymptotic setting where both 
the number of variables (nodes) p and the number of samples n tend to infinity. We 
assume that the parameters {n,p, Jmin) scale in the following fashion|f| 

n = n{J^Jogp). (11) 

We require that the number of nodes p — )• oo to exploit the local separation properties 
of the class of graphs under consideration. 

(A2) a-Walk-summability: The Gaussian graphical model Markov on Gp ~ 9{p) is a- 
walk summable a.a.s., i.e., 

||RGj|<a<l, a.e. Gp ~ g(p), (12) 

where a is a constant (i.e., not a function of p), R := [|i?(i,j)|] is the entry-wise 
absolute value of the partial correlation matrix R and ||-|| denotes the spectral norm. 

(A3) Local-Separation Property: We assume that the ensemble of graphs S(p;??, 7) 
satisfies the (r/, 7)-local separation property with ij, 7 satisfying: 

7? = 0(1), JminD;,la-^' = uj{l), (13) 

where a is given by ()12p and -Dmin := niiuj J(i, i) is the minimum diagonal entry of 
the potential matrix J. 

(A4) Condition on Edge-Potentials: The minimum absolute edge potential of an a- 
walk summable Gaussian graphical model satisfies 

A.in(l - a) min ^^ > I + 5, (14) 

(j,j)eGp K{i,j) 

for almost every Gp ~ S(p), for some (5 > (not depending on p) anq^ 

2 



K{i,j):=p[V\{i,j],{i,j])r, 



8. The notations io[-), n(-) refer to asymptotics as the number of variables p — > 00. 

9. Here and in the sequel, for A,B <ZV , we use the notation 3{A, B) to denote the sub-matrix of J indexed 
by rows in A and columns in B. 

10 
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is the spectral norm of the submatrix of the potential matrix J, and Z?min := niinj J{i, i) 
is the minimum diagonal entry of J. Intuitively, (J14p limits the extent of non- 
homogeneity in the model and the extent of overlap of neighborhoods. Moreover, 
this assumption is not required for consistent graphical model selection when the 
model is attractive {Jij < for i 7^ j)cj 

(A5) Choice of threshold S,n,p'- The threshold £,n,p for graph estimation under CCT 
algorithm is chosen as a function of the number of nodes p, the number of samples n, 
and the minimum edge potential Jmin as follows: 



(oP \ I logp \ 

■jj— I , Cn,p = ^ I Y —^ I ' (15) 

where a is given by ()12p . -Dmin := niiuj J{i,i) is the minimum diagonal entry of the 
potential matrix J, and 7 is the path-threshold ([9]) for the (r/, 7)-local separation 
property to hold. 

Assumption (Al) stipulates how n, p and Jmin should scale for consistent graphical model 
selection, i.e., the sample complexity. The sample size n needs to be sufficiently large with 
respect to the number of variables p in the model for consistent structure reconstruction. 
Assumptions (A2) and (A4) impose constraints on the model parameters. Assumption 
(A3) restricts the class of graphs under consideration. To t he best of our knowledge, all 



previous works dealing with g raphical model selection, e.g., iMeinshausen and Biihlmann 



( 20061 ) ■ iRavikumar et al.l ( 20081 ). also impose some conditions for consistent graphical model 



selection. Assumption (A5) is with regard to the choice of a suitable threshold ^n,p for 
thresholding conditional covariances. In the sequel, we compare the conditions for consistent 
recovery after presenting our main theorem. 

Example 1: Degree-Bounded Ensembles 

To gain a better understanding of conditions (A1)~(A5), consider the ensemble of graphs 
SDeg(p; ^) with bounded deg ree A G N. It can be established that for the walk-summability 
condition in (A3) to hold |^ we require that for normalized precision matrices (J(i,i) = 1), 

Jmax = O (^^ . (16) 

See Section IA.2I for detailed discussion. When the minimum potential achieves the bound 
(Jmin = 0(1/^)); & Sufficient condition for (A3) to hold is given by 

Aa^ = o(l), (17) 

where 7 is the path threshold for the local-separation property to hold according to Defi- 
nition [2j Intuitively, we require a larger path threshold 7, as the degree bound A on the 
graph ensemble increases. 



10. The assumption (A5) rules out the possibihty that the neighbors are marginally independent. See 
Section rB.3l for details. 

11. We can provide improved bounds for random-graph ensembles. See Section lA. 21 for details. 
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Note that (|17p allows for the degree bound A to grow with the number of nodes as 
long as the path threshold 7 also grows appropriately. For example, if the maximum degree 
scales as A = 0(poly(logp)) and the path-threshold scales as 7 = O(loglogp), then (fT7|) is 
satisfied. This implies that graphs with fairly large degrees and short cycles can be recovered 
successfully using our algorithm. 

Example 2: Girth-Bounded Ensembles 

The condition in (J17p can be specialized for the ensemble of girth-bounded graphs SGirth(PJ 9) 
in a straightforward manner as 

Aa3 = 0(1), (18) 



where g corresponds to the girth of the graphs in the ensemble. The condition in (jlSp 
demonstrates a natural tradeoff between the girth and the maximum degree; graphs with 
large degrees can be learned efficiently if their girths are large. Indeed, in the extreme 
case of trees which have infinite girth, in accordance with (jlSp . there is n o constraint on 



node degrees for successful recovery and recall that the Chow-Liu algorithm (jChow and Liul . 
19681 ) is an efficient method for model selection on tree distributions. 



Example 3: Erdos-Renyi and Small- World Ensembles 

We can also conclude that a.e. Erdos-Renyi graph G ~ 9er{p, c/p) satisfies (fT3]) when 
c = 0(poly(logp)) under the best-possible scaling of Jmin subject to the walk-summability 
constraint in (fT2|) . 

This is because it can be shown that Jmin = 0(l/vA) for walk-summability in (J12p 
to hold. See Section [A. 21 for details. Noting that a.a.s., the maximum degree A for G ~ 
Ser(p, c/p) satisfies 

'logplogc^ 



A = 



log log p 



from (JBollobasl . \l983i . Ex. 3.6) and 7 = 0(|^) from ([TO]). Thus, the Erdos-Renyi graphs 
are amenable to successful recovery when the average degree c = 0(poly(logp)). Similarly, 
for the small-world ensemble Swatts (p, <3^, c/p) , when d = 0(1) and c = 0(poly(logp)), the 
graphs are amenable for consistent estimation. 

3.2 Consistency of Conditional Covariance Thresholding 

Assuming (Al) ~ (A5), we now state our main result. The proof of this result and the 
auxiliary lemmata for the proof can be found in Sections [B] and Section [Cj 

Theorem 4 (Structural consistency of CCT) For structure learning of Gaussian graph- 
ical models Markov on a graph Gp ~ S(p;^, 7); CCT(x"; ^„^j,,ry) is consistent for a.e. graph 
Gp. In other words, 

^hm^ P [CCT ({x"}; e„,p, r?) / Gp] = (19) 



Remarks: 
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1. Consistency guarantee: The CCT algorithm consistently recovers the structure 
of Gaussian graphical models asymptotically, with probability tending to one, where 
the probability measure is with respect to both the random graph (drawn from the 
ensemble S(p;^, 7) and the samples (drawn from nr=i /(^«l^))- 

2. Analysis of sample complexity: The above result states that the sample complex- 
ity for the CCT (n = ^(Jjnin ^^SP))' which improves when the minimum edge potential 
Jmin is largeo This is intuitive since the edges have stronger potentials in this case. 
On the other hand, Jmin cannot be arbitrarily large since the a-walk-summability as- 
sumption in ()12p imposes an upper bound on Jmin- The minimum sample complexity 
(over different parameter settings) is attained when Jmin achieves this upper bound. 
See Section IA.2I for details. For example, for any degree-bounded graph ensemble 
S(p, A) with maximum degree A, the minimum sample complexity is n = Q{A'^ logp) 
i.e., when Jmin = ©(l/A), while for Erdos-Renyi random graphs, the minimum sample 
complexity can be improved to n = Q{Alogp), i.e., when Jmin = 6(l/vA). 



Comp arison with iRavikumar et al.l (J2008l ): The work bv iRavikumar et al. 
(|2008l l employs an ^i-penalized likelihood estimator for structure estimation in Gaus- 



sian graphical models. Under the so-called incoherence conditions, the sample com- 
plexity is n = r2((A^ -|- J~jjj)logp). Our sample complexity in (fTTI) is the same in 
terms of its dependence on Jmin, and there is no explicit dependence on the max- 
imum degree A. Moreover, we have a transparent sufficient condition in terms of 
Q-walk-summability in (J12p . which directly imposes scaling conditions on Jmin- 

4. Com parison with lMeinshausen and Biihlmannl ( 20061 ) : The work bv lMeinshausen and Biihlmann 
(I2OO6I ) considers £i-penalized linear regression for neighborhood selection of Gaussian 

graphical models and establish a sample complexity of n = ri((A -|- J~^^)logp). We 
note that our gua rantees allow for graphs which do n ot necessarily satisfy the condi- 
tions imposed by iMeinshausen and Biihlmannl (120061). For instance, the a s sump tion 
of neighborhood stability (assumption 6 in ( Meinshausen and Biihlmannl . 12006 )) is 
hard to verify in general, and the relaxation of this assumption corresponds to the 
class of models with diagonally-dominant covariance matrices. Note that the class 
of Gaussian graphical models with diagonally-dominant covariance matrices forms a 
strict sub-class of walk-summable models, and thus satisfies assumption (A2) for the 
theorem to hold. Thus, Theorem |H applies to a large r class of Gaussian graphical mod- 
els compared to IMeinshausen and Biihlmannl ( 20061 ). Furthermore, the conditions for 
successful recovery in Theorem HI are arguably more transparent. 

5. Comparison ■with Ising models: Our above result for learning Gaussian graphi- 
cal models is analogous to st ructure estimation of Ising models subject to an upper 
bound on the edge potentials ( Anandkumar et al.l . l2011bl ). and we characterize such a 
regime as a conditional uniqueness regime. Thus, walk-summability is the analogous 
condition for Gaussian models. 

Proof Outline We first analyze the scenario when exact statistics are available, (i) We 
establish that for any two non-neighbors (i,j) ^ G, the minimum conditional covariance 



12. Note that the sample complexity also implicitly depends on walk-summability parameter a through ([13^ 
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in ([T|) (based on exact statistics) does not exceed the threshold ^n,p- (h) Similarly, we also 
establish that the conditional covariance in ([1]) exceeds the threshold ^n,p for all neighbors 
{i,j) S G. (iii) We then extend these results to empirical versions using concentration 
bounds. 

3.2.1 Performance of Conditional Mutual Information Test 



We now employ the conditional mutual information test, analyzed in I Anandkumar et al. 



(|2011bl ) for Ising models, and note that it has slightly worse sample complexity than using 
conditional covariances. Using the threshold ^n,p defined in (fTS]), the conditional mutual 
information test CM IT is given by the threshold test 

min /(X,;X,|X5)>d,p, (20) 

\S\<ri 

and node pairs (i, j) exceeding the threshold are added to the estimate G" Assuming (Al) 
- (A5), we have the following result. 

Theorem 5 (Structural consistency of CM IT) For structure learning of the Gaussian 
graphical model on a graph Gp ~ S(j';^,7), CMIT(x"; ^„^p,r/) is consistent for a.e. graph 
Gp. In other words, 

^hm^ P[CMIT({x"};e„,p,r?) ^ Gp] = (21) 

n=niJ-Uogp) 

The proof of this theorem is provided in Section IC.3I 

Remarks: 

1. For Gaussian random variables, conditional covariances and conditional mutual in- 
formation are equivalent tests for conditional independence. However, from above 
results, we note that there is a difference in the sample complexity for the two tests. 
The sample complexity of CMIT is n = 0(J~jj^logp) in contrast to n = 0(J~j^logp) 
for CCT. This is due to faster decay of conditional mutual information on the edges 
compared to the decay of conditional covariances. Thus, conditional covariances are 
more efficient for Gaussian graphical model selection compared to conditional mutual 
information. 

4. Necessary Conditions for Model Selection 

In the previous sections, we proposed and analyzed efficient algorithms for learning the 
structure of Gaussian graphical models Markov on graph ensembles satisfying local-separation 
property. In this section, we study the problem of deriving necessary conditions for consis- 
tent structure learning. 

For the class of degree-bounded gra phs Snegfp, A), ne cessary conditions on sample com- 
plexity have been characterized before (jWang et al.l . l20ld ) by considering a certain (limited) 



set of ensembles. However, a naive application of such bounds (based on Fano's inequality 
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Xm ^ p^i^x) 
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Encoder 



Decoder 



XT. 



Figure 2: The canonical source coding problem. See Cliapter 3 in ( Cover and Thomad . 
200^ 1. 



( Cover and Thomaa . 120061 . Ch. 2)) turns out to be too weak for the class of Erdos-Renyi 
graphs 9er{p, c/p), where the average degreq^ c is much smaller than the maximum degree. 
We now provide necessary conditions on the sample complexity for recovery of Erdos- 
Renyi graphs. Our information-theoretic techniques may also be applicable to other ensem- 
bles of random graphs. This is a promising avenue for future work. 



4.1 Setup 

We now describe the problem more formally. A graph G is drawn from the ensemble of 
Erdos-Renyi graphs G ~ Ser(P)C/p). The learner is also provided with n conditionally 
i.i.d. samples X*^ := (Xi,...,X„) G {X'p)'^ (where Af = M) drawn from the conditional 
(Gaussian) product probability density function (pdf) HiLi /(^jIG*)- The task is then to 
estimate G, a random quantity. The estimate is denoted as G := G(X"). It is desired to 
derive tight necessary conditions on n (as a function of c and p) so that the probability of 
error 



PjP) := P{G / G) ^ 



(22) 



IS 



as the number of nodes p tends to infinity. Note that the probability measure P in ([22 
associated to both the realization of the random graph G and the samples X*^. 

The task is reminiscent of source coding (or compre ssion), a problem of central impor- 
tance in information theory ( Cover and Thomad . 120061 ) - we would like to derive funda- 
mental limits associated to the problem of reconstructing the source G given a compressed 
version of it X" (X" is also analogous to the "message"). However, note the important 
distinction; while in source coding, the source coder can design both the encoder and the 
decoder, our problem mandates that the code is fixed by the conditional probability density 
/(xjC). We are only allowed to design the decoder. See comparisons in Figs. [2]and[3l 



4.2 Necessary Conditions for Exact Recovery 

To derive the necessary condition for learning Gaussian graphical models Markov on sparse 
Erdos-Renyi graphs G ~ Ser(P) c/p), we assume that the strict walk-summability condition 
with parameter a, according to (|12p. We are then able to demonstrate the following: 

Theorem 6 (Weak Converse for Gaussian Models) For a walk-summable Gaussian 
graphical model satisfying ()12p with parameter a, for almost every graph G ~ 9er{p,c/p) 



13. The techniques in this section are apphcable when the average degree (c) of Ser(p, c/p) ensemble is a 
function of p, e.g., c = 0(poIy(logp)). 



15 



Anandkumar, Tan, and Willsky 



G 





x*^ e (MP)" 




SER(p,f) 


11/(x.ig: 




Decoder 


G 









Figure 3: The estimation problem is analogous to source coding: the "source" is G ~ 
Ser(p, ^), the "message" is X" G (M^)" and the "decoded source" is G. We are 
asking what the minimum "rate" (analogous to the number of samples n) are 
required so that G = G with high probability. 



,{p) 

e 

n > 



as p ^ oo, in order for Pe — >• 0, we require that 

2 



plog2 



2^e T^ + 1 



y< 



(23) 



for all p sufficiently large. 



The proof is provided in Section ID.ll By expanding the binary entropy function, it is easy 
to see that the statement in ([^5|) can be weakened to the necessary condition: 



n > 



clogaP 



log2 



27re 



1 



(24) 



The above condition does not involve any asymptotic notation, and also demonstrates the 
dependence of the sample complexity on p, c and a transparently. Finally, the dependence 
on a can be explained as follows: any a-walk-summable model is also /3-walk-summable 
for all P > a. Thus, the class of /3-walk-summable models contains the class of a-walk- 
summable models. This results in a looser bound in (I23p for larger a. 



4.3 Necessary Conditions for Recovery with Distortion 

In this section, we generalize Theorem [6] to the case where we only require estimation of 
the underlying graph up to a certain edit distance: an error is declared if and only if the 
estimated graph G exceeds an edit distance (or distortion) D of the true graph. The edit 
distance d : 6p x 6p — > NU {0} between two undirected graphs G = {V, E) and G = {V, E') 
is defined as d{G,G') := \EAE'\, where A denotes the symmetric difference between the 
edge sets E and E' . The edit distance can be regarded as a distortion measure between two 
graphs. 

Given an positive integer D, known as the distortion, suppose we declare an error if and 
only if d{G, G') > D, then the probability of error is redefined as 



p(p) :=P(d(G,G(X")) > D). 



(25) 



We derive necessary conditions on n (as a function of p and c) such that the probability of 
error ()25p goes to zero as p — )■ oo. To ease notation, we define the ratio 



P:=D/ 



P 



(26) 
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Note that /3 may be a function of p. We do not attempt t o make this depe ndence expHcit. 
The fohowing corohary is based on an idea propounded bv lKim et alj ( 20081 ) among others. 



Corollary 7 (Weak Converse for Discrete Models With Distortion) For P^^ 

we must have 



(p) 



0, 



n > 



plog2 
for all p sufficiently large. 



2vre ( T^ + 1 



p 



^b(/3) 



(27) 



The proof of this corollary is provided in Section ID.7I Note that for (j27p to be a useful 
bound, we need (3 < c/p which translates to an allowed distortion D < cp/2. We observe 
from (j27p that because the error criterion has been relaxed, the required number of samples 
is also reduced from the corresponding lower bound in (j23p . 



4.4 Proof Techniques 

Our analysis tools for deriving necessary conditions for Gaussian graphical model selection 
are information-theoretic in nature. A common and natural to ol to derive necessary con- 
ditions (also called converses) is to resort to Fano's inequality ( Cover and Thomasl . 120061 . 



Chapter 2), which (lower) bounds the probability of error Pe as a function of the equivo- 
cation or conditional entropy //(G|X") and the size of the set of all graphs with p nodes. 
However, a direct and naive application Fano's inequality results in a trivial lower bound as 
the set of all graphs, which can be realized by 5er{p, c/p) is, loosely speaking, "too large". 
To ameliorate such a problem, we employ another information-theoretic notion, known 
as typicality. A typical set is, roughly speaking, a set that has small cardinality and yet has 
high probability as p — t- oo. For example, the probability of a set of length- ?7i sequences is 
of the order ~ 2™ (where H is the entropy rate of the source) and hence those sequences 
with probability close to this value are called typical. In our context, given a graph G, we 
define the d{G) to be the ratio of the number of edges of G to the total number of nodes p. 
Let 0p denote the set of all graphs with p nodes. For a fixed e > 0, we define the following 
set of graphs: 



rip) .-{Gee 



d{G) 



< 



(28) 



-ip) 



-ip) 



The set Te is known as the e-typical set of graphs. Every graph G E 72 has an average 
number of edges that is |e-close in the Erdos-Renyi ensemble. No te that typicality ideas are 
usually used to derive sufficient conditions in information theory ( Cover and Thomaa . 120061 ) 
{achiev ability in information-theoretic parlance); our use of both typicality for graphical 
model selection as well as Fano's inequality to derive convers e statements seems novel . 
Indeed, the proof of the converse of the source coding theorem in lCover and Thomasl (200a, 
Chapter 3) utilizes only Fano's inequality. We now summarize the properties of the typical 
set. 

Lemma 8 (Properties of 7e ) The e-typical set of graphs has the following properties: 



1. P{t 



(p)> 



1 as p ^ oo. 
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2. For all G G T^^\ we hav^ 
exp2 



2;if.i- 1(1 + 



< P{G) < exp2 



y-y 



(29) 



3. The cardinality of the e-typical set can be bounded as 



(1 - e) exp2 
for all p sufficiently large. 



".W 



<|7;(^)|<exp2 



Q«.^ia + o 



(30) 



The proof of this lemma can be found in Section [D21 Parts 1 and 3 of Lemma [8] respectively 
say that the set of typical graphs has high probability and has very small cardinality relative 
to the number of graphs with p nodes |©p| = exp2((2))- Part 2 of Lemma [8] is known as 
the asymptotic equipartition property: the graphs in the typical set are almost uniformly 
distributed. 



5. Implications on Loopy Belief Propagation 

An active area of research in the graphical model community is that of inference - i.e., 
the task of computing node marginals (or MAP estimates) through efficient distributed 
algorithms. The simplest of these algorithms is the belief propagation^^ (BP) algorithm, 
where messages are passed among the neighbors of the graph of the model. It is known 
that belief propagation (and max-product) is exact on tree models, meaning that correct 
marginals are computed at all the nodes (Pearl, Il988l ). On the other hand on general 
graphs, the generalized version of BP, known as loopy belief propagation (LBP), may not 
converge and even if it does, the marginals may not be correct. Motivated by the twin 
problems of convergence and correctness, there has been extensive work on characterizing 
LBP's performance for different models. See Section [5.31 for details. As a by-product of our 
previous analysis on graphical model selection, we now show the asymptotic correctness of 
LBP on walk-summable Gaussian models when the underlying graph is locally tree-like. 



5.1 Background 

The belief propagation (BP) algorithm is a distributed algorithm where messages (or beliefs) 
are passed among the neighbors to draw inferences at the nodes of a graphical model. The 
computation of node marginals through naive variable elimination (or Gaussian elimination 
in the Gaussian setting) is prohibitively expensive. However, if the graph is sparse (consists 
of few edges), the computation of node marginals may be sped up dramatically by exploiting 
the graph structure and using distributed algorithms to parallelize the computations. 

For the sake of completeness, we now recall the basic steps in LBP, specific to Gaussian 
graphical models. Given a message schedule which specifies how messages are exchanged. 



14. We use the notation exp2( ■ ) to mean 2' ' '. 

15. The variant of the belief propagation algorithm which computes the MAP estimates is known as the 
max-product algorithm. 
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each node j receives information from each of its neighbors (according to the graph), where 
the message, m\^Axj), from i to j, in t"" iteration is parameterized as 



7TT,, 






{xj) := exp 



1 



Ajf_,x^, + Ah 






Each node i prepares message m\^Axj) by collecting messages from neighbors of the pre- 
vious iteration (under parallel iterations), and computing 

Ji\j{t) = J{i,i)+ Yl ^Jtl^^ k\j{t) = h{i) + Yl ^hk^^{^), 
k(iM{i)\j k&M{i)\j 

where 

AJU^ = -J{j, i)J-^.{t)J{j, i), AhU^ = -J{j, i)J7^.{t)hk^,{t). 

5.2 Results 

Let Slbp(^)0 denote the variance at node i at the LBP fixed pointj^ Without loss of 
generality, we consider the normalized version of the precision matrix 

J = I - R, 

which can always be obtained from a general precision matrix via normalization. We can 
then renormalize the variances, computed via LBP, to obtain the variances corresponding 
to the unnormalized precision matrix. 

We consider the following ensemble of locally-tree like graphs. Consider the event that 
the neighborhood of a node i has no cycles up to graph distance 7, given by 

r(z;7, G) := {B^i{i;G) does not contain any cycles}. 

We assume a random graph ensemble 9(p) such that for a given node i E F, we have 

P[r(i;7,G)]=o(l). (31) 

Proposition 9 (Correctness of LBP) Given an a-walk-sumniable Gaussian graphical 
model on a.e. locally tree-like graph G ~ S(p;7) with parameter 7 satisfying (j3ip . we 
have 

|SG(i,i) - SLBp(i,OI "='' 0(max(a^Pr(i;7,G)])). (32) 

The proof is given in Section IB.4[ 

Remarks: 

1. The class of Erdos-Renyi random graphs, G ~ 5ERip,c/p) satisfies (j3T]l . with 7 = 
0(logp/logc) for a node i £V chosen uniformly at random. 

2. Recall that the class of random regular graphs G ~ 9Rog(P;^) have a girth of 
0{log^_ip). Thus, for any node i £V, (J3T]) holds with 7 = 0(log^_;^p). 



16. Convergence of LBP on walk-summable models has been established by iMalioutov et al.l l|2006l '). 
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5.3 Previous Work on Loopy Belief Propagation 



It has long been known through numerous empirical studie s ( Murphy et al.1 . Il999l ) and the 



phenomenal successes of turbo decoding ( McEliece et alj . |2002| ). that loopy belief prop- 



agation (LBP) performs reasonably well on a variety of graphical models t hough it also 
must be mentioned that LBP fails catastrophically on other models. IWeisa ( 2000 ) proved 
that if the underlying graph (of a Gaussian graphical model) consists of a single cy- 
cle, LBP converges and is correct, i.e., the fixed points of the means and the variances 
are the same as the true mean s and variances. In add ition, sufficient conditions for a 
unique fixed point are known ( Mooii and Kapperj . 120071 ). The max -product variant of 
LBP (called the max-product or min-sum algorithm ) has been studied ( Bayati et al.l . l2005l : 
Sanghavi et al.l . 12003 : iRuozzi and Tatikondal . l20ld ). Despite its seemingly heuristic nature, 
LBP has found a variety of concrete applications, e specia lly in combinatorial optimiza- 



tion ( Moallemi and Van Royl . l2010l : iGamarnik et al.l . l2010 l). Indee d, it has been applie d 



and analyze d for NP-hard proble ms such as maximum mat ching (IBavati et al. 
b- matching (ISanghavi et al.l . l2009l ) , the Steiner tree problem (JBayati et al.l . l2008al ) 



2008bl ). 



The application of BP for inference in G aussian graphical models h as been studied ex- 
tensively - starting with the seminal work bv I Weiss and FreemanI (|200ll ). Undoubtedly the 
Kalman filter is the most familiar instance of BP in Gaussian graphical models. The no- 



tion o f walk-summability in Gaussian graphical models was introduced bv iMalioutov et al. 
( 20061 ). Among other results, the authors showed that LBP converges to the correct means 



for walk- summable models but the estimated variance s may neverthel e ss sti ll be incor 



rect. IChandrasekaran et al.l (J2008l ) leveraged the ideas of lMalioutov et al.l (J2006l ) to analyze 
related in ference al g orithm s such as embedded trees and the block Gauss-Seidel method. 
Recently, iLiu et al.l ( 2010l ) considered a modified ve rsion of LBP b y identifying a special 
set of nodes - called the feedback vertex set (FVS) (IVazirani I2OOII ) - that breaks (or ap- 
proximately breaks) cycles in the loopy graph. This allows one to perform inference in 
a tractably to tradeoff accuracy and computational complexity. For Gaussian graphical 
models Markov on locally tree-like graphs, an approximate FVS can be identified. This 
set, though not an FVS per se, allows one to break all the short cycles in the graph and 
thus, it allows for proving tight error bounds on the inferred variances. The performance of 
LBP on locally tree-like graphs has also been studied for other famil ies of graphical models. 
For Is ing models Markov on locally tree-like graphs, the work by iDembo and Montanari 
(J20ld ) established an analogous result for attractive (also known as ferromagnetic) models. 
Note that walk-summable Gaussian graphical models is a superset of the class of attractive 
Gauss ian models. An interpretation of LBP in terms of graph covers is given bv IVontobel 
( 20ld) and its equivale nce to walk-summability for Gaussian graphical models is established 
bv lRuozziet all (I2OO9I '). 



6. Conclusion 

In this paper, we adopted a novel and a unified paradigm for graphical model selection. 
We presented a simple local algorithm for structure estimation with low computational 
and sample complexities under a set of mild and transparent conditions. This algorithm 
succeeds on a wide range of graph ensembles such as the Erdos-Renyi ensemble, small-world 
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networks etc. We also employed novel information-theoretic techniques for establishing 
necessary conditions for graphical model selection. 
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Appendix A. Walk-summable Gaussian Graphical Models 

A.l Background on Walk-Summability 

We now recap t he properties of walk-su mmable Gaussian graphical models, as given by (I12p . 
For details, see iMalioutov et al.l ( 20061 ). For simplicity, we first assume that the diagonal of 



the potential matrix J is normalized {J{i,i) = 1 for all i gV). We remove this assumption 
and consider general unnormalized precision matrices in Section IB. 21 Consider splitting the 
matrix J into the identity matrix and the partial correlation matrix R, defined in ([7]): 

J = I - R. (33) 

The covariance matrix Xl of the graphical model in (j33p can be decomposed as 

oo 

S = J-1 = (I - R)-i = Y^ K^, ||R|| < 1, (34) 

fc=0 

using Neumann power series for the matrix inverse. Note that we require that ||R|| < 1 for 
(|34p to hold, which is implied by walk-summability in (J12p (since ||R|| < ||R||). 

We now relate the matrix power R' to walks on graph G. A walk ■w of length / > 
on graph G is a sequence of nodes w := {wo,wi, . . . ,wi) traversed on the graph G, i.e., 
{wk,Wk+i) S G. Let |w| denote the length of the walk. Given matrix He supported on 
graph G, let the weight of the walk be 

|w| 
0(w) := Y[R{wk-i,Wk). 

The elements of the matrix power R' are given by 

R\i,j) = Y^ 0(w), (35) 

. I . 

where i ^ j denotes the set of walks from i to j of length /. For this reason, we henceforth 
refer to R as the walk matrix. 
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Let i ^ j denote all the walks between i and j. Under the walk-summability condition 
in (|12p . we have convergence of ^^.j^,- 0(w), irrespective of the order in which the walks 
are collected, and this is equal to the covariance S(z, j). 

In Section IA.31 we relate walk-summability in (|12p to the notion of correlation decay, 
where the effect of faraway nodes on covariances can be controlled and the local-separation 
property of the graphs under consideration can be exploited. 

A. 2 Sufficient Conditions for Walk-summability 

We now provide sufficient conditions and suitable parameterization for walk-summability 
in (112p to hold. The adjacency matrix Ac of a graph G with maximum degree Aq satisfies 

Amax(AG) < Ag, 

since it is dominated by a A-regular graph which has maximum eigenvalue of A^. From 
Perron-Frobenius theorem, for adjacency matrix Aq, we have Amax(AG') = ||Ag||, where 
II AgII is the spectral radius of A^. Thus, for Rg supported on graph G, we have 

a:= IIRgII =0(JmaxA), 

where Jmax := niaxjj \R{i,j)\. This implies that 

Jmax = O (^] (36) 

to have a < 1, which is the requirement for walk-summability. 

When the graph G is a Erdos-Renyi random graph, G ^ SkrIjOic/p), we ca n provide 



better bounds. When G ~ 9er{p,c/p), we have (JKrivelevich and Sudakovl . l2003l ) . that 



Amax(AG) = (1 + o(l)) max(vAG, c), 

where Ag is the maximum degree and Ag is the adjacency matrix. Thus, in this case, 
when c = 0(1), we require that 



Jra^ = 0[J-\, (37) 




for walk-summabi lity (a < 1). Note that when c = O (poly (log p)), w.h.p. Agp 
G(logp/loglogj9) (|Bollobasl . Eossl . Ex. 3.6). 



A. 3 Implications of Walk-Summability 

Recall that 5]g denotes the covariance matrix for Gaussian graphical model on graph G and 
that Jg = S^ with Jq = I — Rg in pSj) . We now relate the walk-summability condition 
in (J12p to correlation decay in the model. In other words, under walk-summability, we can 
show that the effect of faraway nodes on covariances decays with distance, as made precise 
in Lemma [TOl 

Let -B'y(i) denote the set of nodes within 7 hops from node i in graph G. Denote 

H^.^j := G{B^{i) n B.,{j)) (38) 
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as the induced subgraph of G over the intersection of 7-hop neighborhoods at i and j and 
retaining the nodes in 1/ \ {B^{i) n B^{j)}. Thus, H-y-ij has the same number of nodes as 
G. . We first make the following simple observation: the {i,j) element in the 7**" power of 
walk matrix, RQ(i,j), is given by walks of length 7 between i and j on graph G and thus, 
depends only on subgraphcZI H^-ij (see ([35]) ). This enables us to quantify the effect of nodes 
outside B^{i) n B^{j) on the covariance T,Gii,j)- 
Define a new walk matrix R/7 - . such that 

( RGia,b), a,bGB^ii)nB^iJ), (39) 

[0, o.w. (40) 

In other words, Hh .^ is formed by considering the Gaussian graphical model over graph 



Hy-ij. Let 'Sh -i- denote the corresponding covariance matrixP^I 

Lemma 10 (Covariance Bounds Under Walk-summability) For any walk-summable 
Gaussian graphical model [a := ||Rg|| < 1); we hava^^\ 

2n 
max|SG(i, j) - SH,.,,(i, j)| < a^- = 0{ar). (41) 

Thus, for walk-summable Gaussian graphical models, we have a := ||Rg|| < 1; imply- 
ing that the error in (I4ip in approximating the covariance by local nei ghborhood decays 



expon entially with distance. Parts of the proof below are inspired by iDumitriu and Pal 



Proof: Using the power-series in (I34p . we can write the covariance matrix as 

7 

^G = ^^R-G + Eg, 
k=0 

where the error matrix Eg has spectral radius 



I ,, II 1 1 - LG 

EgII < " 



Rrll^+' 



i-||Rg| 

from ()34p . Thus|^ for any i,j £ V, 



|SG(i,j) -J2RGiiJ)\ < ^^fl"' (42) 

^-' i — Kg 



^ .l^^l 



Similarly, we have 

7 iij} 



7 IIJD 117+1 



17. Note that R'^{i,j) = if Bj{i) n B-y{j) = 0. 

18. When B.,{i) C\ B^{j) = meaning that graph distance between i and j is more than 7, we obtain 

19. The bound in (1411) also holds if H^-ij is replaced with any of its supergraphs. 

20. For any matrix A, we have max^j |A(i, j)| < ||A||. 
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5jiRGir+' 



where for inequality (a), we use the fact that 



IR, 



(44) 



'G\ 



WRh ■11 < WR-H -11 < IIR-gII, 

since H^-ij is a subgrapho of G. 

Combining (|42p and (|44p . using the triangle inequality, we obtain (j4ip . □ 

We also make some simple observations about conditional covariances in walk-summable 

models. Recall that Rg denotes matrix with absolute values of Rg, and Re is the walk 

matrix over graph G. Also recall that the a-walk summability condition in (112p . is ||RgII ^ 

a <1. 



Proposition 11 (Conditional Covariances under Walk-Summability) Given a walk- 
summable Gaussian graphical model, for any i,j£V and S CV with i,j ^ S, we have 



Moreover, we have 



S(i,j|S)= Yl '^g(w). 



(45) 






sup S(i,i|S) < (1 - a)-^ = 0(1). 

scv\i 



(46) 



Proof: We have, from lRue and Heidi (|2005l . Thm. 2.5), 

where J-s-S;G denotes the submatrix of potential matrix Jq by deleting nodes in S. Since 
submatrix of a walk-summable matrix is walk-summable, we have ()45p by appealing to the 
walk-sum expression for conditional covariances. 

For ()46p . let ||A||oo denote the maximum absolute value of entries in matrix A. Using 
monotonicity of spectral norm and the fact that ||A||oo < ||A||, we have 



sup S(i,i|5) < II J 
ScV,i^V 



-1 

-S,-S;G 



(1-||R_S,-S;g||)"' 

R-5,-5;gII)-'<(i-||RgII)-' = 0(i). 



D 



Thus, the conditional covariance in (j45p consists of walks in the original graph G, not 
passing through nodes in S. 



21. When two matrices A and B are such that |yl(i, j)| > \B{i,j)\ for all i,j, we have ||A|| > ||B| 
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Appendix B. Graphs with Local-Separation Property 

B.l Conditional Covariance between Non-Neighbors: Normalized Case 

We now provide bounds on the conditional covariance for Gaussian graphical models Markov 
on a graph G ~ S(p;??,7) satisfying the local-separation property {r],^), as per Definition [2j 

Lemma 12 (Conditional Covariance Bet^veen Non-neighbors) For a walk-summable 
Gaussian graphical model, the conditional covariance between non-neighbors i and j, con- 
ditioned on S^, the 'j-local separator between i and j, satisfies 

maxS(i;j|5-,) = 0(||RGir). (47) 

Proof: In this proof, we abbreviate S-y by S for notational convenience. The conditional 
covariance is given by the Schur complement, i.e., for any subset A such that AO S = 9, 

^{A\s) = j:{a, A) - j:{a, s)^{s, sy^j:{s, a). (48) 

We use the notation 5]G(yl,yl) to denote the submatrix of the covariance matrix Sg, 
when the underlying graph is G. As in Lemma \TU\ we may decompose Xlc as follows: 



^G = '^H-y + E. 



7' 



where H^ is the subgraph spanned by 7-hop neighborhood B.y{i), and E^ is the error matrix. 
Let F^ be the matrix such that 

Sg(S,S)-i = Sh,(5,5)-i+F^. 

We have TjH.y{i-,3\S) = 0, where T,H~,{i-,j\S) denotes the conditional covariance by con- 
sidering the model given by the subgraph H^f. This is due to the Markov property since i 
and j are separated by S in the subgraph H^. 

Thus using (H8]l . the conditional covariance on graph G can be bounded as 

SG(i,i|5) = 0(max(||E^||,||F^||)). 

By Lemma [TOl we have ||E^|| = OdlRGp). Using Woodbury matrix-inversion identity, we 

also have IIF^II =C»(||RGir). □ 

B.2 Extension to General Precision Matrices: Unnormalized Case 

We now extend the above analysis to general precision matrices J where the diagonal 
elements are not assumed to be identity. Denote the precision matrix as 

J = D - E, 

where D is a diagonal matrix and E has zero diagonal elements. We thus have that 

Jnorm := D-O'^JD-O-^ = I - R, (49) 
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where R is the partial correlation matrix. This also implies that 

T_r)0.5j qos 

«j — ±j *j norm-'-' 

Thus, we have that 

^ = D-0•5I]normD-0•^ (50) 

where Snorm := Jnarm is the covariance matrix corresponding to the normalized model. 
When the model is walk-summable, i.e., ||R|| < a < 1, we have that Snorm = '^k>o^^- 

We now utilize the results derived in the previous sections involving the normalized 
model (Lemma 1101 and Lemma [T2|l to obtain bounds for general precision matrices. 

Lemma 13 (Covariance Bounds for General Models) For any walk-summable Gaus- 
sian graphical model (a := ||R-gII < 1); "^^ have the following results: 

1. Covariance Bounds: The covariance entries upon limiting to a subgraph H^-ij for 
any i,j £ V satisfies 

n^ ')ni / n'^l \ 

max|SG(., j) - Sh,,, (^, j)| < j^- = O (-—] , (51) 

where Z?min := minjL'(i,i) = miuj J(i,i). 

2. Conditional Covariance bet^veen Non-neighbors: The conditional covariance 
between non-neighbors i and j, conditioned on S-y, the 'j-local separator between i and 
j, satisfies 



a 



7 



'nun 



rnax S(i;i|S'^) = O , (52) 



where Z?min := niinj D{i, i) = miuj J{i, i). 
Proof: Using (|50p and Lemma [TOl we have (j5ip . Similarly, it can be shown that for any 

ScV\{i,j},i,jeV, 

E(i,i|5) = D-0■5Snorm(i,i|5)D-0•^ 

where T,^oj-^{i,j\S) is the conditional covariance corresponding to the model with normal- 
ized precision matrix. From Lemma [T2l we have ()52p . □ 

B.3 Conditional Covariance between Neighbors: General Case 

We provide a lower bound on conditional covariance among the neighbors for the graphs 
under consideration. Recall that Jmin denotes the minimum edge potentials. Let 

K{i,j):=\\3{V\{i,j},{i,j})f, 

where J{V \ {i,j}, {i,j}) is a sub-matrix of the potential matrix J. 
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Lemma 14 (Conditional Covariance Bet^veen Neighbors) For an a-walk summable 
Gaussian graphical model satisfying 

Dmin(l-a) mill ^^>l + 5, (53) 

for some 6 > (not depending on p), where -Dmin '■= miiij J{i,i), we have 

|SG(i,j|5)|=S7(J^in), (54) 

for any {i,j) G G such that j G A/'(i) and any subset S CV with i,j ^ S. 
Proof: First note that for attractive models, 

j:G{iJ\s)>^GAi,J\s) 

where Gi is the graph consisting only of edge {i,j)- Inequality (a) arises from the fact 
that in attractive models, the weights of all the walks are positive, and thus, the weight of 
walks on Gi form a lower bound for those on G (recall that the covariances are given by 
the sum- weight of walks on the graphs). Equality (b) is by direct matrix inversion of the 
model on Gi. 

For general models, we need further analysis. Let A = {i,j} and B = V \{S L) A}, for 
some S C V \ A. Let 'S{A,A) denote the covariance matrix on set A, and let J{A,A) := 
^{A, A)~^ denote the corresponding marginal potential matrix. We have for all 5 C ^ \ ^ 

3{A, A) = 3{A, A) - J{A, B)3{B, B)-^J{B, A). 

Recall that || A||oo denotes the maximum absolute value of entries in matrix A. 

(a) 

II J(^, B)J{B, B)-^J{B, A)\\oo < \\J{A, B)J{B, B)-^J{B, A)\\ 



(b) 

<\\3{A,B)f\\JiB,B)-^\ 
\\HA,B)\\' 



Amm(J(-B,i?)) 



(56) 



(| K{i,j)' ^^^^ 

~ -Dmin(l - a) 

where inequality (a) arises from the fact that the ioo norm is bounded by the spectral norm, 
(b) arises from sub-multiplicative property of norms and (c) arises from walk-summability 
property. Inequality (b) is from the bound on edge potentials and a-walk summability of 
the model and since K{i,j) > \\J{A,B)\\. Assuming ([53]) . we have 

\J{i,j)\ > Jmin - y^^'^^" , = ^(Jmin). 

Since _ 

-J{i,j) 



^Gii,j\S) = ^^— 

J{i,i)J{j,3)-J{i,2Y^ 

we have the result. □ 
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B.4 Analysis of Loopy Belief Propagation 

Proof of Proposition\^ From Lemma[TO]in Section [A. 31 for any a-walk-sumniable Gaussian 
graphical model, we have, for all nodes i gV conditioned on the event T{i;'j,G), 



|SG(i,i)-SLBp(i,i)|=0(||RGir). 



(58) 



This is because conditioned on r(i;7,G), it is shown that the series expansions based 
on walk-sums corresponding to the variances Sj:/^..^. (z,i) and Slbp(^,^) are identical up to 
length 7 walks, and the effect of walks beyond length 7 can be bounded as above. Moreover, 
for a sequence of a-walk-summable, we have T,{i,i) < M for all i £ V, for some constant 
M and similarly Slbp(^) j) < M' for some constant M' since it is obtained by the set of 
self-avoiding walks in G. We thus have 

E[|SG(i,i)-5]LBp(i,i)|] < [0(||RGir)+P[r(i;7)]] =0(1), 

where E is over the expectation of ensemble S(p)- By Markov's inequalitjo, we have the 
result. □ 

Appendix C. Sample-based Analysis 
C.l Concentration of Empirical Quantities 



For ou r sample complexity analysis, we recap the concentration result by lRavikumar et al. 



S, Lemma 1) fo^ sub-Gaussian matrices and specialize it to Gaussian matrices. 



Lemma 15 (Concentration of Empirical Covariances) For any p- dimensional Gaus- 
sian random vector X. = [Xi, . . . ,Xp], the empirical covariance obtained from n samples 
satisfies 



P 



< 4exp 



ne 



3200M2 



S(i,j)-S(i,i)|>e 
for all e £ (0, 40M) and M := maxj S(z, i). 

This translates to bounds for empirical conditional covariance. 



(59) 



Corollary 16 (Concentration of Empirical Conditional Covariance) For a walk-summable 
p-dimensional Gaussian random vector X = [Xi, . . . ,Xp], we have 



P 



max \^{i,j\S) -T.{i;j\S)\ > e 
.scy.\s\<'q 



<V+^exp(-^) 



(60) 



where K S (0, 00) is a constant which is bounded when ||5]||j^ is bounded, for all e S (0, 40M) 
with M := maxj S(i, i), and n > rj. 



22. By Markov's inequality, for a non-negative random variable X, we have P[X > S] < ¥,[X]/5. By choosing 
5 = a;(E[X]), we have the result. 
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Proof: For a given i,j G V and S CV with ij < n, using (j48p . 



P 



^{i,j\S) - J:{i;j\S)\ > e] < P [(l S(i, j) - S(i; j)| > 

(J (|S(i,A;)-S(i;A;)| > K'e 



k£S 



where K' is a constant which is bounded when ||S||g^ is bounded. Using Lemma \T5\ we 
have the result. □ 



C.2 Proof of Theorem H 

We are now ready to prove Theorem [H We analyze the error events for the conditional 
covariance threshold test CCT. For any {i,j) ^ Gp, define the event 



^i(i,i;{x"},Gp):={|S(i,i|5)|>en,p}, 



(61) 



where £,n,p is the threshold in (fT5]) and S is the 7-local separator between i and j (since the 
minimum in ([T]) is achieved by the 7-local separator). Similarly for any edge {i,j) € Gp, 
define the event that 

J-2(i,i;{x"},Gp) := [bScV: \S\ < r/, |S(i,i|5)| < C„,p} . (62) 

The probability of error resulting from CCT can thus be bounded by the two types of errors, 



P[CCT({x"};^„,p)/Gp]< 



+ . 



U J-2(i,i;{x"},G', 

(ij)GG'p 



U J-i(i,j;{x"},G, 



(63) 



For the first term, applying union bound for both the terms and using the result (|60p of 
Lemma \T5\ 



U T2{iJ;{^^},Gp) 

(hJ)&Gp 



O p^+2 



p ' exp 






where 



G^inip) := inf \^{i,j\S)\ = Q (J^in) , V:p G N, 

(*,i)GGp 

scv,i,jfs 

\s\<v 



(64) 



(65) 



from ([691) . Since ^n.p = o(Jmin), dM]) is o(l) when n > Llogp/J^-^^, for sufficiently large L 
(depending on ij and M). For the second term in ()63p . 



U -Fi(i,j;{x"},Gp) 



Ofp'^+^exp 



'>T'{c,n,p ~ Ginax\P))^ 



(66) 
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where 

C^ax(p) := ^max |S(i,i|5)| = O (^\ , (67) 

from ([68l) . For the choice of ^n,p in (fTSl) . (fMI) is o(l) and this completes the proof of 
Theorem HI D 

C.3 Conditional Mutual Information Thresholding Test 

We now analyze the performance of conditional mutual information threshold test. We first 
note bounds on conditional mutual information. 

Proposition 17 (Conditional Mutual Information) Under the assumptions (Al)-(A5), 
we have that the conditional mutual information among non-neighbors, conditioned on the 
'y-local separation satisfies 

max I{Xi;Xj\Xs^) = ©(a^^), (68) 

and the conditional mutual information among the neighbors satisfy 

min I{X,;X,\Xs)=n{JlJ. (69) 

scv\{i,j} 

Proof: The conditional mutual information for Gaussian variables is given by 

I {Xi- Xj\Xs) = -^\og[l - p''{i,j\S)] , (70) 

where p{i,j\S) is the conditional correlation coefficient, given by 



p{h3\S) ■■-- 



VWJsWUJis) 



From (l46]) in Proposition [TT| we have T,{i,i\S) = 0(1) and thus, the result holds. □ 

We now note the concentration bounds on empirical mutual information. 

Lemma 18 (Concentration of Empirical Mutual Information) For any p- dimensional 
Gaussian random vector X = [Xi, . . . ,Xp], the empirical covariance obtained from n sam- 
ples satisfies 

P{\m;X,)-I{Xf,Xj)\ > e) < 24exp (^-^^M^) , (71) 

for some constant L which is finite when Pmax := maxj^j \p{i,j)\ < 1, and all e < Pmax; oind 
for M := maxj 'E{i,i). 



Proof: The result on empirical covariances can be found in ( Ravikumar et al.l . l2008l . 



Lemma 1). The result in (j7ip will be shown through a sequence of transformations. First, 
we will bound P{\p{i,j) — p{i,j)\ > e). Consider, 

P{\p{iJ)-p{iJ)\>e) 
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P 



S(i,i) 



S(i,j) 



:s(i,i)s(j,i))i/2 (s(.,i)s(i,j))i/2 

1/2 

- 1 



> e 



^ihj) / 5](z,i)S(j,j) 



S(i,j) lE(i,i)S(j,i) 



> 



WJ)\ 



(a) /S(i,i) 



|p(i,i)l 



1/3^ 



^^tey<" 



\p{iJ)\ 



1/3^ 



/S(i,i) 
+ P[ .A^> 1 + 
ls(i,i) 



|P(^,J)I 



''].p(m<n 



+p(SM>fi+ ^ 



(b) ft(i,j) 



ip(^,j)i 



2/3^ 



S(i,i) 



+ p[SM<ii 



,5](j,j) 



8|p(i,i)| 






8|p(^,j)| 



S(i,i) 3|p(z,j)|_ 



+p(M4<i 



(c) 

< 24exp 



204800|/)(i,i) 



(d) 

2 , < 24exp 



S(i,j) 

nMe^ \ 
~ 204800 J 



3|p(^,j)| 



3|p(i,i)| 



e 

+ ... 
+ .. 



2/3\ 



+ 



2/3^ 



where in (a), we used the fact that P{ABC > 1 + d) < P{A > (1 + 6)^^^ or B > {1 + 
(5)^/^ or C > (1 + 5)^/^) and the union bound, in (b) we used the fact that {1 + 6)'^ < 1 + 86 
and (1 + 5)~^/^ < 1 - S/3 for 5 = e/\p{i,j)\ < 1. Finally, in (c), we used the result in ([59]) 
and in (d), we used the bounds on p < 1. 

Now, define the bijective function /(|p|) := — l/21og(l — p^). Then we claim that there 
exists a constant L G (0,oo), depending only on Pmax < 1, such that 



|/(x) - I{y)\ < L\x-y\, 



(72) 



i.e., the function / : [0,pmax] — ^ I^^ is L = L(pinax)-Lipschitz. This is because the slope of 
the function / is bounded in the interval [0,pniax]- Thus, we have the inclusion 



{\I{Xi;Xj) - I{X,;X,)\ > e} C {\p{i,j) - p{t,j)\ > e/L} 



(73) 



since if \I{Xi;Xj) - I{Xi;Xj)\ > e it is true that L\p{i,j) - p{i,j)\ > e from ([72]). We have 
by monotonicity of measure and (173p the desired result. □ 

We can now obtain the desired result on concentration of empirical conditional mutual 
information. 
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Lemma 19 (Concentration of Empirical Conditional Mutual Information) For a 

walk-summahle p-dimensional Gaussian random vector X = [Xi, . . . ,Xp], we have 



max \I{Xf, X,\Xs) - I{X^;Xj\Xs)\ > e 

.scv\{i,j},\s\<ii 



^-"'^^-(-^S)''^^' 



for constants M,L £ (0,cx)) and all e < Pmax; where Pma.x '■= max i^j \p{i,j\S)\. 

scv\{i,j},\s\<v 

Proof: Since the model is walk-summable, we have that maxj^^ T,{i, i\S) = 0(1) and thus, 
the constant M is bounded. Similarly, due to strict positive-definiteness we have /Omax < 1 
even as p — )• oo, and thus, the constant L is also finite. The result then follows from union 
bound. □ 

The sample complexity for structural consistency of CM IT follows on lines of analysis 
for CCT. 

Appendix D. Necessary Conditions for Model Selection 

D.l Necessary Conditions for Exact Recovery 

We provide the proof of Theorem [6] in this section. We collect four auxiliary lemmata whose 
proofs (together with the proof of Lemma [8|) will be prov ided at the end of the sect ion. For 
information-theoretic notation, the reader is referred to ICover and Thomad ( 2000 ). 



Lemma 20 (Upper Bound on Differential Entropy of Mixture) Let a < 1. Sup- 
pose asymptotically almost surely each precision matrix Jq = I ~ R-G satisfies (I12p . i.e., 
that ||Rg|| ^ Oi for a.e. G £ 9ip)- Then, for the Gaussian model, we have 

where recall that X"|G ~ nr=i fi^ilQ- 

For the sake of convenience, we define the random variable: 

(p) 
The random variable W indicates whether G £ Te ■ 

Lemma 21 (Low^er Bound on Conditional Differential Entropy) Suppose that each 
precision matrix Jq has unit diagonal. Then, 

MX"|G,W)>-^log2(27re). (77) 

Lemma 22 (Conditional Fano Inequality) In the above notation, we have 

HiG\X^,GGTj)-l ^ p(g(x") / G|G G 7;(P)). (78) 

log2(|7^^^V 1) 
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Lemma 23 (Exponential Decay in Probability of Atypical Set) Define the rate func- 
tion K{c, e) := |[(1 + e) ln(l + e) — e]. The probability of the e-atypical set decays as 

PHrl^^r) = P{G i Tl^^) < 2expi-pKic,e)) (79) 

for all p> 1. 

Note the non-asymptotic nature of the bound in ()79p . The rate function K(c,e) satisfies 
liui^io K [c, e) / e'^ = c/4. We prove Theorem [6] using these lemmata. 
Proof: Consider the following sequence of lower bounds: 



pn 
2 ^^2 


^'"M>\(X") 


U-aJ -^ ^ 




(b) 

> /i(X"|VF) 




= /(X";G|Ty) + /i(X"|G,Ty) 




>/(X";G|t^)-^log2(27re) 



(80) 



= H{G\W) - H{G\X^, W)-?^ log2(2vre), (81) 

where (a) follows from Lemma [20l (6) is because conditioning does not increase differential 
entropy and (c) follows from Lemma [2T1 We will lower bound the first term in (|8ip and 
upper bound the second term in (j8ip . Now consider the first term in (j8ip : 



H{G\W) = H{G\W = l)P{W = 1) + H{G\W = Q)P{W = 0) 

(a) 

>H{G\W = l)P{W = 1) 

(b) . , 

>F(G|GGr,(P))(l-e) 

§(i-.)(^)h.(^). (B.) 

where (o) is because the entropy H{G\W = 0) and the probability P{W = 0) are both 
non-negative. Inequality (6) follows for all p sufficiently large from the definition of W as 
well as Lemma [5] part 1. Statement (c) comes from fact that 






ser."') 



y-'^ 



We are now done bounding the first term in the difference in (j8ip . 

Now we will bound the second term in (fST]) . First we will derive a bound on H{G\'X.^, W 
1). Consider, 

Pi^) := P(G(X") / G) 
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'•^ P(G(X") ^G\W = 1)P{W = 1) + P(G(X") / G\W = 0)P{W = 0) 
> P(G(X") ^ G|Ty = l)P{W = 1) 

> p(g(x")/g|gg7;(p)) ^ ^ 



(g iJ-(G|X",G G 7^' 



■(P)^ 



logo it; 



2 I 'e 



(P)| 



1 + e 

1 



1 + e 



(83) 



where (a) is by the law of total probability, (6) holds for all p sufficiently large by Lemma [8] 
part 1 and (c) is due to the conditional version of Fano's inequality (Lemma [22]) . Then, 
from (1831). we have 



iJ(G|X'^, W = l)< PiP)(l + e) log2 \rlP^\ + 1 



(84) 



Define the rate function K{c, e) := |[(1 + e) ln(l + e) — e]. Note that this function is positive 
whenever c, e > 0. In fact it is monotonically increasing in both parameters. Now we utilize 
dMD to bound H{G\X'',W): 

i7(G|X", W) = H{G\XJ', W = l)P{W = 1) + F(G|X", W = Q)P{W = 0) 

(a) 



< /7(G|X", W = l) + H{G\yJ', W = ^)P{W = 0) 



(b) 

< H{G\1C, W = l) + H{G\yJ', W = Q){2e 

(c) 

<i/(G|X",VF = l)+p2(2e 



-pK{c,e) 



-pK{c,e)\ 



(d) 



Ap)i 



<pr(i+.)(;,j/f.^^ 



+ 1 + 2p^e 



-pK{c,e) 



where (a) is because we upper bounded P{W = 1) by unity, (b) follows by Lemma 
follows by upper bounding the conditional entropy by p^ and (d) follows from (j84p . 
Substituting ([82]) and ([85]) back into ^ yields 



(85) 
(c) 



pn 



log2 



27re 



1 — a 



+ 1 



>-(^-<Hl 



which implies that 



n > 



plog2 



2^elT^ + l 



l)^^[p 



2/ Vp 



Pip)(l + 

'^ ^l-6)-PiP)(l + e 



-^j'^^^ 



1 - 2p^e 
G(p2e-P^(^'^)), 



-piC'(c,e) 



:i-e)-PiP)(l + e)l -e{pe 



,-pK(c,e)\ 



Note that 0(pe p^(^''^)) — )- as p — )• oo since the rate function K{c,e) is positive. If we 
impose that Pe — >• as p — t- oo, then n has to satisfy (I23p by the arbitrariness of e > 0. 
This completes the proof of Theorem [6j □ 
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D.2 Proof of Lemma [8] 

Proof: Part 1 follows directed from the law of large numbers. Part 2 follows from the 

(p) 
fact that the Binomial pmf is maximized at its mean. Hence, for G G Te , we have 



P(0) . ( j) 



cp/2 



P 



{-ycp/2 



We arrive at the upper bound after some rudimentary algebra. The lower bound can be 

(p) 
proved by observing that for G G Te , we have 

/„\ cp{l+e)/2 . X (P)_cp(l+e)/2 



= exp2 
> exp2 



n(^l0g2-)(l + 6) + [l-c(l + 6)/p]l0g2(l--: 

2/ p p p 

2)(^^°S2^)(l + e) + (l + 6)(l-^)l0g2(l-^) 



The result in Part 2 follows immediately by appealing to the symmetry of the binomial pmf 
about its mean. Part 3 follows by the following chain of inequalities: 



1 = Yl p(G) > Y. p{G) > Y 



exp2 



Ge«„ 



(p) 



GeV 



\T}^^\ 



G&V 



(p) 



PffJ-a + ^) 



p 



exp2 



(p)i 



This completes the proof of the upper bound on |7^ |. The lower bound follows by noting 
that for sufficiently large n, P{Te ) > 1 — e (by Lemma [8] Part 1). Thus, 



l-6< ^ P{G)< Y e^P2 

This completes the proof. 



>{l 



\T}P^ 



exp2 



>{l 



n 



D.3 Proof of Lemma 



Proof: Note that the distribution of X (with G marginalized out) is a Gaussian mixture 
model given by X^j^gg P(G)AA(0, J^ ). As such the covariance matrix of X is given by 



Ge<8p 



(86) 



This is not immediately obvious but it is due to the zero-mean nature of each Gaussian 
probability density function M{0, 3q ). Using (j86|) . we have the following chain of inequal- 
ities: 

/i(X") < n/i(X) 
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<^log2((27refdet(Sx)) 

Tl 

= - [plog2(27re) + log2 det(Sx)] 

(b) n 

< 77 blog2(27re) + plog^ Xmi.A'^x.) 



n 
2 



(c) n 
< - 
- 2 



n 
2 



(d)n 
< - 
- 2 



;3log2(27re) +plog2Amax ^ P{G)^G^ 

plog2(2^e)+plog2 J^ P(G)A„iax (Jg^ 
\GGSp 

plog,(2.e)+pl„g,(j:^P(G)^^j 



plog2(2vre) +plog2 



»n , / 27re 



where (a) uses the maximum entropy principle ( Cover and Thomad . l2006l . Chapter 13) i.e., 
that the Gaussian maximizes entropy subject to an average power constraint (6) uses the 
fact that the determinant of Sx is upper bounded by AmaxCSx)", (c) uses the convexity of 
Amax( • ) (it equals to the operator norm || • ||2 over the set of symmetric matrices, (d) uses 
the fact that a > ||Rg||2 > llR-clb = ||I - Jclb = Amax(I - Jg) = 1 - Amm(JG) a.a.s. This 
completes the proof. □ 

D.4 Proof of Lemma 1211 

Proof: Firstly, we lower bound /i(X"|G, VK = 1) as follows: 

MX"|G)= Y. P{g)h{X^\G = g) 



(a) 



n 



Y,P{9M^\G = g) 



^=^^ j;P(5)log2[(2vrerdet(J,i)] 
= -iE^(5)log2[(2vrefdet(J,)] 

>-5E^(5)l°g2[(2vren 
seSp 

>-^log2(27re), 
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where (a) is because the samples in X" are conditionally independent given G = g, (b) is 
by the Gaussian assumption, (c) is by Hadamard's inequality 



det(Jg) < iliJgU = 1 



i=l 



and the assumption that each diagonal element of each precision matrix Jg 
equal to 1 a.a.s. This proves the claim. 



(87) 



Hg is 

D 



D.5 Proof of Lemma I 

Proof: Define the "error" random variable 



E 



1 G(X") / G 
G(X") = G 



Now consider 



H{E, G|X", W = l) = H{E\X!^, W = l) + H{G\E, X", W = 1) 
= H{G\}C, W = l) + H{E\G, X", W = 1). 



The first term in (j88p can be bounded above by 1 since the alphabet of the random variable 
E is of size 2. Since H{G\E = 0, X", M^ = 1) = 0, the second term in (i88]) can be bounded 
from above as 

H{G\E, X", W = l) = H{G\E = 0, X", W = l)P{E = {)\W = 1) 

+ H{G\E = 1, X", W = l)P{E = l\W = 1) 

< p(G(x") / G|G G ri^)) log2(|7;(P)| - 1). 



The second term in (1891) is 0. Hence, we have the desired conclusion. 



D 



D.6 Proof of Lemma 

Proof: The proof uses standard Chernoff bounding techniques but the scaling in p is 
somewhat different from the usual Chernoff (Cramer) upper bound. For simplicity, we will 
use M := (2)- Let Yi,i = 1, . . . ,M be independent Bernoulli random variables such that 
P{Yi = 1) = c/p. Then the probability in question can be bounded as 



P{G i T^P^) = P 



1 ^^ 1 

— S^Y-- 
cp^ ' 2 



(a) 
< 



M 



>i 






> 



l+e 



(b) 

< 2E 



A/ 



exp 



t^y,-pt-(l + e) 



i=l 



(90) 
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M 



2exp (-pt-{l + e)) []E[exp(ty,)] 



(91) 



j=i 



where (a) follows from the union bound, (6) follows from an application of Markov's in- 
equality with t > in (j90p . Now, the moment generating function of a Bernoulli random 
variable with probability of success q is ge* + (1 — q). Using this fact, we can further upper 
bound (1911) as follows: 



P{G i T^P^) = 2exp ( -pt^(l + e) + Mln(-e* + (1 - -) 



V 



P 



(a) / c, 

< 2exp -pt-{l + e) + 



P(P-1)£^^*_1^ 



< 2 exp ( —p 



c , , c 



2 p 
' 1) 



(92) 



where in (a), we used the fact that ln(l + z) < z . Now, we differentiate the exponent 
in square brackets with respect to t > to find the tightest bound. We observe that the 
optimal parameter is t* = ln(l + e). Substituting this back into (j92p completes the proof. 
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D.7 Necessary Conditions for Recovery with Distortion 

We now provide the proof for Corollary [7l 

The proof of Corollary [7| follows from the following generalization of the conditional 
Fano's ineq uality pres e nted in Lemma [22l This is a modified version of an analogous 
theorem in ( Kim et al.l . 120081 ). 



Lemma 24 (Conditional Fano's Inequality (Generalization)) In the above notation, 
we have 



-ip) 



H{G\X-,Ge% )-l-log,L ^ p^^^^^ g^^„^^ ^ ^1^ ^ ^(p)^ 



(p) 



iog2(|7;' 

where L = (X)Hh{P) o^f^d P is defined in ([26 



(93) 



We will only provide a proof sketch of Lemma [25] since it is similar to Lemma [22j Proof: 
The key to establishing ([93]) is to upper bound the cardinality of the set {G G 0p : 
d{G,G') < D}, which is isomorphic to {E G (Bp : |£^A£"| < D}, where ^p is the set 
of all edge sets (with p nodes). For this purpose, we order the node pairs in a labelled 
undirected graph lexicographically. Now, we map each edge set E into a length- (|) bit- 
string s{E) E {0, l}v2J. The characters in the string s{E) indicate whether or not an edge 
is present between two node pairs. Define dH{s,s') to be the Hamming distance between 
strings s and s'. Then, note that 

I^A^'I = dH{s{E), s{E')) = dH{s{E) e s{E'), 0) (94) 

where ® denotes addition in F2 and denotes the all zeros string. The relation in ()94p 
means that the cardinality of the set {E G G;„ : \EAE'\ < D} is equal to the number of 
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strings of Hamming weight less than or equal to D. With this realization, it is easy to see 
that 



|{. G {0, !}(') : dH{s,0) <D}\=J2 (f ) < 2^ 



By using the same steps as in the proof of Lemma[23] (or Fano's inequality for list decoding), 
we arrive at the desired conclusion. D 

References 

P. Abbeel, D. Koller, and A.Y. Ng. Learning factor graphs in polynomial time and sample 
complexity. The Journal of Machine Learning Research, 7:1743-1788, 2006. 

A. Anandkumar, V. Y. F. Tan, and A. S. Willsky. High-Dimensional Structure Learning of 
Ising Models on Sparse Random Graphs. Preprint. Available on \arXiv:1011.0129\ Nov. 
2010. 

A. Anandkumar, A. Hassidim, and J. Kelner. Topology Discovery of Sparse Random Graphs 
With Few Participants. arXiv: 1102.5063, Feb. 2011a. 

A. Anandkumar, V. Y. F. Tan, and A. S. Willsky. High-Dimensional Structure Learning of 
Ising Models: Tractable Graph Families. Preprint, Available on ArXiv 1107.1736, June 
2011b. 

M. Bayati, D. Shah, and M. Sharma. Maximum Weight Matching via Max-Product Belief 
Propagation. In Proc. IEEE Intl. Symposium on Information Theory (ISIT), 2005. 

M. Bayati, A. Braunstein, and R. Zecchina. A rigorous analysis of the cavity equations for 
the minimum spanning tree. Journal of Mathematical Physics, 49:125206, 2008a. 

M. Bayati, D. Shah, and M. Sharma. Max-product for maximum weight matching: Con- 
vergence, correctness, and Ip duality. Information Theory, IEEE Transactions on, 54(3): 
1241-1251, 2008b. 

P.J. Bickel and E. Levina. Covariance regularization by thresholding. The Annals of Statis- 
tics, 36(6):2577-2604, 2008. 

A. Bogdanov, E. Mossel, and S. Vadhan. The Complexity of Distinguishing Markov Random 
Fields. Approximation, Randomization and Combinatorial Optimization. Algorithms and 
Techniques, pages 331-342, 2008. 

B. Bollobas. Random Graphs. Academic Press, 1985. 

G. Bresler, E. Mossel, and A. Sly. Reconstruction of Markov Random Fields from Sam- 
ples: Some Observations and Algorithms. In Intl. workshop APPROX Approximation, 
Randomization and Combinatorial Optimization, pages 343-356. Springer, 2008. 

V. Chandrasekaran, J.K. Johnson, and A.S. Willsky. Estimation in Gaussian graphical 
models using tractable subgraphs: A walk-sum analysis. Signal Processing, IEEE Trans- 
actions on, 56(5):1916-1930, 2008. 

39 



Anandkumar, Tan, and Willsky 



J. Cheng, R. Greiner, J. Kelly, D. Bell, and W. Liu. Learning bayesian networks from data: 
an information-theory based approach. Artificial Intelligence, 137(1-2) :43-90, 2002. 

M.J. Choi, V.Y.F. Tan, A. Anandkumar, and A. Willsky. Learning Latent Tree Graphical 
Models. J. of Machine Learning Research, 12:1771-1812, May 2011. 

C. Chow and C. Liu. Approximating Discrete Probability Distributions with Dependence 
Trees. IEEE Tran. on Information Theory, 14(3):462-467, 1968. 

F.R.K. Chung. Spectral graph theory. Amer Mathematical Society, 1997. 

F.R.K. Chung and L. Lu. Complex graphs and network. Amer. Mathematical Society, 2006. 

T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., 2006. 

A. d'Aspremont, O. Banerjee, and L. El Ghaoui. First-order methods for sparse covariance 
selection. SIAM. J. Matrix Anal. & AppL, 30(56), 2008. 

A. Dembo and A. Montanari. Ising Models on Locally Tree-like Graphs. Annals of Applied 
Probability, 2010. 

S. Dommers, C. Giardina, and R. van der Hofstad. Ising models on power-law random 
graphs. Journal of Statistical Physics, pages 1-23, 2010. 

I. Dumitriu and S. Pal. Sparse regular random graphs: spectral density and eigenvectors. 
Arxiv preprint larXiv:091 0. 5306. 2009. 

D. Gamarnik, D. Shah, and Y. Wei. Belief propagation for min-cost network flow: conver- 
gence &: correctness. In Proc. of ACM-SIAM Symposium on Discrete Algorithms, pages 
279-292, 2010. 

A. Gamburd, S. Hoory, M. Shahshahani, A. Shalev, and B. Virag. On the girth of random 
cayley graphs. Random Structures & Algorithms, 35(1):100-117, 2009. 

J.Z. Huang, N. Liu, M. Pourahmadi, and L. Liu. Covariance matrix selection and estimation 
via penalised normal likelihood. Biometrika, 93(1), 2006. 

M. Kalisch and P. Biihlmann. Estimating high-dimensional directed acyclic graphs with 
the pc-algorithm. J. of Machine Learning Research, 8:613-636, 2007. 

D. Karger and N. Srebro. Learning Markov Networks: Maximum Bounded Tree-width 
Graphs. In Proc. of ACM-SIAM symposium on Discrete algorithms, pages 392-401, 
2001. 

Y.-H. Kim, A. Sutivong, and T. M. Cover. State Amplification. IEEE Transactions on 
Information Theory, 54(5): 1850 - 1859, May 2008. 

M. Krivelevich and B. Sudakov. The largest eigenvalue of sparse random graphs. Combi- 
natorics, Probability and Computing, 12(01):61-72, 2003. 

C. Lam and J. Fan. Sparsistency and rates of convergence in large covariance matrix 
estimation. Annals of statistics, 37(6B):4254, 2009. 

40 



High-Dimensional Gaussian Graphical Model Selection 



S.L. Lauritzen. Graphical models: Clarendon Press. Clarendon Press, 1996. 

H. Liu, M. Xu, H. Gu, A. Gupta, J. LafFerty, and L. Wassernian. Forest density estimation. 
J. of Machine Learning Research, 12:907-951, 2011. 

Y. Liu, V. Chandrasekaran, A. Anandkumar, and A. Willsky. Feedback Message Passing 
for Inference in Gaussian Graphical Models. In Proc. of IEEE ISIT, Austin, USA, June 
2010. 

L. Lovasz, V. Neumann-Lara, and M. Plummer. Mengerian theorems for paths of bounded 
length. Periodica Mathematica Hungarica, 9(4):269-276, 1978. 

D.M. Malioutov, J.K. Johnson, and A.S. Willsky. Walk-Sums and Belief Propagation in 
Gaussian Graphical Models. J. of Machine Learning Research, 7:2031-2064, 2006. 

R.J. McEliece, D.J.C. MacKay, and J.F. Cheng. Turbo decoding as an instance of Pearl's 
belief propagation algorithm. Selected Areas in Communications, IEEE Journal on, 16 
(2):140-152, 2002. ISSN 0733-8716. 

B.D. McKay, N.C. Wormald, and B. Wysocka. Short cycles in random regular graphs. The 
Electronic Journal of Combinatorics, 11(R66):1, 2004. 

N. Meinshausen and P. Biihlmann. High Dimensional Graphs and Variable Selection With 
the Lasso. Annals of Statistics, 34(3):1436-1462, 2006. 

C.C. Moallemi and B. Van Roy. Convergence of min-sum message-passing for convex opti- 
mization. Information Theory, IEEE Transactions on, 56(4):2041-2050, 2010. 

J.M. Mooij and H.J. Kappen. Sufficient Conditions for Convergence of the Sum-Product 
Algorithm. Information Theory, IEEE Transactions on, 53(12) :4422~4437, 2007. ISSN 
0018-9448. 

K. Murphy, Y. Weiss, and M.I. Jordan. Loopy belief propagation for approximate inference: 
An empirical study. In Proc. of Uncertainty in AI, pages 467-475, 1999. 

P. Netrapalli, S. Banerjee, S. Sanghavi, and S. Shakkottai. Greedy Learning of Markov Net- 
work Structure . In Proc. of AUerton Conf. on Communication, Control and Computing, 
Monticello, USA, Sept. 2010. 

J. Pearl. Probabilistic Reasoning in Intelligent Systems — Networks of Plausible Inference. 
Morgan Kaufmann, 1988. 

P. Ravikumar, M.J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covari- 
ance estimation by minimizing ^i-penalized log-determinant divergence. Arxiv preprint 
\arXiv:081 1.362 8. 2008. 

A.J. Rothman, P.J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance 
estimation. Electronic Journal of Statistics, 2:494-515, 2008. 

H. Rue and L. Held. Gaussian Markov Random Fields: Theory and Applications. Chapman 
and Hall, London, 2005. 

41 



Anandkumar, Tan, and Willsky 



N. Ruozzi and S. Tatikonda. Convergent and correct message passing schemes for optimiza- 
tion problems over graphical models. Arxiv preprint \arXiv:l 002. 3239[ 2010. 

N. Ruozzi, J. Thaler, and S. Tatikonda. Graph covers and quadratic minimization. In 
Communication, Control, and Computing, 2009. AUerton 2009. 4^th Annual Allerton 
Conference on, pages 1590-1596, 2009. 

S. Sanghavi, D. Shah, and A.S. Willsky. Message passing for maximum weight independent 
set. Information Theory, IEEE Transactions on, 55(ll):4822-4834, 2009. ISSN 0018- 
9448. 

N.P. Santhanam and M.J. Wainwright. Information-theoretic Limits of High-dimensional 
Model Selection. In International Symposium on Information Theory, Toronto, Canada, 
July 2008. 

P. Spirtes and C. Meek. Learning bayesian networks with discrete variables from data. In 
Proc. of Intl. Conf. on Knowledge Discovery and Data Mining, pages 294-299, 1995. 

V.Y.F. Tan, A. Anandkumar, L. Tong, and A. Willsky. A Large-Deviation Analysis for the 
Maximum Likelihood Learning of Tree Structures. IEEE Tran. on Information Theory, 
March . 

V.Y.F. Tan, A. Anandkumar, and A. Willsky. Learning Gaussian Tree Models: Analysis 
of Error Exponents and Extremal Structures. IEEE Tran. on Signal Processing, 58(5): 
2701-2714, May 2010. 

V.Y.F. Tan, A. Anandkumar, and A. Willsky. Learning Markov Forest Models: Analysis 
of Error Rates. J. of Machine Learning Research, 12:1617-1653, May 2011. 

V.V. Vazirani. Approximation Algorithms. Springer, 2001. 

Pascal O. Vontobel. Counting in graph covers: A combinatorial characterization of the 
bethe entropy function. Arxiv 1012.0065, 2010. 

W. Wang, M.J. Wainwright, and K. Ramchandran. Information-theoretic bounds on model 
selection for Gaussian Markov random fields. In IEEE International Symposium on In- 
formation Theory Proceedings (ISIT), Austin, Tx, June 2010. 

D.J. Watts and S.H. Strogatz. Collective dynamics of small-worldnetworks. Nature, 393 
(6684):440-442, 1998. 

Y. Weiss. Correctness of Local Probability Propagation in Graphical Models with Loops. 
Neural Computation, 12(1):1~41, 2000. 

Y. Weiss and W.T. Freeman. Correctness of Belief Propagation in Gaussian Graphical 
Models of Arbitrary Topology. Neural Computation, 13(10):2173-2200, 2001. 

X. Xie and Z. Geng. A recursive method for structural learning of directed acyclic graphs. 
J. of Machine Learning Research, 9:459-483, 2008. 



42 



